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1.  INTRODUCTION 

Automated  tracking  of  moving  objects  in  a  video  in  real  time  is  important  for  different  applications  such  as  video  surveillance,  activity 
recognition,  etc.  Existing  visual  tracking  algorithms  [8,11,12,13,21,22,23,24]  cannot  automatically  adapt  to  changes  in  lighting  conditions, 
background,  types  of  sensors  (e.g.,  EO  vs  IR)  and  their  dynamics  (zooming,  panning,  etc.)  easily.  They  cannot  gracefully  handle  data  that 
simultaneously  contains  different  types  of  motions  such  as  both  slow  and  fast  moving  objects,  motion  behind  an  occlusion,  etc.  Many  of  the 
existing  tracking  algorithms  [8,12]  cannot  start  the  tracking  process  automatically;  they  require  a  user  to  draw  a  box  on  an  object  that  needs  to 
be  tracked  for  the  process  to  be  initiated. 

We  present  an  agile  framework  for  automated  tracking  of  moving  objects  of  full  motion  video  (FMV).  The  framework  is  robust, 
being  able  to  track  multiple  foreground  objects  of  different  types  (e.g.,  person,  vehicle)  having  disparate  motion  characteristics  (like  speed, 
uniformity)  simultaneously  in  real  time  under  changing  lighting  conditions,  background,  and  disparate  dynamics  of  the  camera.  It  is  able  to 
start  tracks  automatically  based  on  a  spatio-temporal  filtering  algorithm  and  is  able  to  gracefully  handle  objects  in  occluded  surroundings. 
Unlike  existing  tracking  algorithms  [12],  with  high  likelihood,  it  does  not  lose  or  switch  tracks  while  following  multiple  similar  closely-spaced 
objects.  The  framework  is  based  on  an  ensemble  of  tracking  algorithms  that  are  switched  automatically  for  optimal  performance  based  on  a 
performance  measure  without  losing  state.  Only  one  of  the  algorithms,  that  provides  the  best  performance  in  a  particular  state  is  active  at  any 
time  providing  computational  advantages  over  existing  ensemble  frameworks  like  boosting.  We  prove  theoretically  (lemmas  1  and  2)  that  the 
presented  agile  tracking  framework  is  more  accurate  than  existing  individual/ensemble-based  algorithms.  A  spatial  classification  algorithm 
based  on  blob  sizes  and  aspect  ratio  allows  our  framework  to  distinguish  vehicles  from  humans.  A  C++  implementation  of  the  framework  (for 
the  purposes  of  this  paper,  we  consider  three  algorithms  in  our  ensemble:  Gaussian  Mixture  Background  Subtraction  (GM),  a  color  histogram 
approach,  and  optical  flow)  has  outperformed  existing  visual  tracking  algorithms  on  most  videos  in  the  Video  Image  Retrieval  and  Analysis 
Tool  (VIRAT:  www.viratdata.org)  and  the  Tracking-Leaming-Detection  [12]  data-sets. 

1.1.  RELATED  WORK 

A  new  particle  filter,  Kernel  Particle  Filter  (KPF),  was  proposed  in  the  [16]  for  visual  tracking  for  multiple  objects  in  image  sequences.  The 
idea  proposed  in  [17]  shows  tracking  using  a  single  classification  SVM.  A  boosting  based  approach  was  proposed  in  [20]  that  used  a  cascade  of 
classifiers  for  object  detection.  However,  it  didn’t  address  the  problem  of  tracking  objects  through  consecutive  frames  of  a  video  sequence. 

A  spatio-temporal  tracking  algorithm  was  proposed  in  [1 1]  that  involved  tracking  articulated  objects  in  image  sequences  through  self¬ 
occlusions  and  changes  in  viewpoint.  However,  they  did  not  provide  capabilities  of  automatic  track  starting  or  tracking  multiple  objects.  Also 
unlike  our  framework,  the  approach  in  [11]  does  not  involve  adapting  to  changing  (environmental  condition/data  distribution)  through  agile 
dynamic  switching  of  trackers  based  on  a  performance  measure.  The  work  in  [13]  combines  background  subtraction,  feature  tracking,  and 
grouping  algorithms.  However,  their  work  doesn’t  have  any  suitable  classification  method  based  on  the  spatial  features  of  the  objects  detected. 

Among  the  existing  tracking  frameworks,  the  one  most  relevant  to  our  work  is  the  TLD  algorithm  proposed  in  [12].  But,  a  problem 
inherent  in  this  algorithm  is  its  inability  to  start  tracks  automatically  as  well  as  lacking  a  multi-object  tracking  feature.  Also,  TLD  is  based  on 
template  matching  and  hence  fails  for  videos  with  multiple  numbers  of  similar  looking  objects  (e.g.,  in  the  Indian  driving  scene  video,  Figure 

4). 


The  approach  proposed  in  [22]  uses  color  histograms  as  the  only  feature.  They  use  a  cascade  composition  of  a  particle  filter  and  mean 
shift.  The  approach  of  [22]  does  not  adaptively  switch  between  multiple  trackers  at  runtime  based  on  a  performance  measure,  unlike  our 
framework.  Also,  the  approach  is  limited  to  two  fixed  algorithms  (particle  filter  and  mean  shift)  whereas  in  our  framework  an  ensemble 
consisting  of  a  plurality  of  algorithms  can  be  used  providing  more  flexibility.  For  example,  for  the  embodiment  of  the  framework  used  for  the 
experiments  in  Section  4,  we  used  Gaussian  mixture  background  subtraction  method,  color  histogram  and  flow,  as  well  as  stateful  switching 
between  an  ensemble  of  trackers.  The  method  proposed  in  [23]  uses  an  incremental  update  function  to  learn  the  object  model.  It  uses  principal 
component  analysis  to  update  the  sample  mean  and  uses  a  forgetting  factor  for  older  observations.  We  use  a  spatial  classification  algorithm 
based  on  blob  sizes  and  aspect  ratio  allows  our  framework  to  distinguish  vehicles  from  humans.  The  work  in  [23]  does  not  provide  insights  on 
the  way  the  track  is  started,  manually  or  automatically.  The  method  proposed  in  [24]  is  similar  to  the  approach  proposed  in  TLD.  The 
difference  between  this  and  TLD  is  that  they  use  multiple  instances  as  the  positive  examples  in  each  frame.  However,  like  TLD,  they  lack  the 
ability  to  start  tracks  automatically  as  marking  the  location  of  the  object  initially  is  a  pre-requisite. 

An  approach  on  multi-target  object  detection  is  proposed  in  [30]  while  [28,  31]  enumerate  approaches  to  target  tracking  based  on 
Markov  models  and  Gait  recognition  respectively.  Another  method  for  detecting  event  sequences  in  surveillance  videos  is  proposed  in  [29]  but 
it  is  applicable  only  to  videos  at  very  low  frame  rate. 

None  of  these  approaches  are  based  on  stateful  dynamic  switching  between  an  ensemble  of  trackers  based  on  a  performance  measure. 
Our  tracking  architecture  is  also  parallelized  resulting  in  an  efficient  implementation  for  real  time  visual  tracking. 

2.  THE  PROPOSED  APPROACH 


Figure  1:  Schematic  representation  of  our  approach 


Figure  1  shows  the  schematic  of  our  approach.  First,  a  moving  object  must  be  automatically  identified  as  part  of  the  foreground.  This  involves 
starting  tracks  at  particular  pixels  on  the  subsequent  frames  that  have  a  higher  probability  of  being  part  of  the  moving  foreground  object.  This  is 
achieved  by  1)  stabilizing  the  image  and  2)  feeding  the  stabilized  image  to  the  spatial  and  temporal  filtering  algorithms  described  below.  Once 
the  track  starter  algorithm  has  precisely  marked  the  object  coordinates,  the  objects  must  be  tracked  if  any  motion  is  to  be  identified.  Issues  such 
as  camera  instability  (shaking,  panning,  rotating)  come  into  play  and  require  image  stabilization  for  the  tracking  to  be  successful. 


2.1.  IMAGE  STABILIZATION 


In  nearly  all  Full  Motion  Video  (FMV),  there  is  at  least  slight  camera  motion.  Aerial  videos  in  particular  typically  contain  jitter  as  well  as 
significant  rotational  and  translational  camera  motion.  For  good  quality  tracking,  a  tracker  must  be  robust  to  this  significant  camera  motion.  If 
the  camera  moves  even  slightly,  the  GM  background  subtraction  algorithm,  an  algorithm  used  to  detect  motion  and  determine  the  target 
object’s  location,  will  incorrectly  detect  stationary  objects  as  moving. 

In  order  to  stabilize  an  incoming  streaming  video,  we  use  the  following  iterative  algorithm  which  attempts  to  hold  each  background 
pixel  in  the  same  position  regardless  of  lateral  and  rotational  camera  motion: 

1  Apply  Shi  and  Tomasi's  [4]  edge-finding  algorithm  to  the  first  frame  to  identify  significant  feature  points  in  the  image. 

2  For  each  subsequent  frame,  apply  Lucas-Kanade  optical  flow  [1]  to  track  the  motion  of  the  features  identified  by  Shi  and  Tomasi's 

algorithm,  refreshing  the  feature  points  when  necessary. 

3  With  increasing  precision  for  each  iteration: 

a  For  each  angle  of  rotation  in  a  certain  range,  determine  the  translation  of  each  point, 

b  Find  the  most  common  (mode)  translation/rotation  pair  (O,  x)  and  (O,  y)  of  all  the  features. 

4  Warp  the  image  to  adjust  for  the  total  mode  of  the  translational  and  rotational  motion. 

Before  we  can  adjust  for  background  motion,  we  must  identify  features  of  the  frame;  to  do  so,  we  use  the  Shi-Tomasi  method  [4]. 
The  Shi-Tomasi  method  detects  features  such  as  comers  and  edges  by  approximating  the  weighted  sum  of  squares  of  image  patches  shifted  by 
certain  values.  The  approximation  results  in  the  vector  (x,  y)  multiplied  by  the  stmcture  tensor,  for  which  there  are  two  eigenvalues  2;  and  if 
either  or  both  is  large  and  positive,  an  edge  or  corner  is  found. 

Next,  we  apply  a  pyramidal  Lucas-Kanade  method  for  determining  optical  flow  at  each  point  of  interest.  We  then  find  the  mode  of 
the  resulting  flow  value  pairs,  including  rotation,  by  placing  the  pairs  in  bins.  Each  iteration,  the  bin  widths  are  decreased,  yielding  an 
increasingly  accurate  estimate  of  the  motion.  The  image  is  then  adjusted  to  account  for  the  determined  background  movement.  When  the  image 
is  stabilized  in  this  manner,  not  only  are  fewer  false  foreground  objects  detected,  but  the  correct  coordinates  of  objects  are  also  maintained. 

If  a  stabilization  failure  is  detected  from  Lucas-Kanade  (LK)  flow  having  many  points  with  a  large  mean- square  error  distance  (due 
probably  to  video  corruption,  or  a  perspective  motion  for  which  we  do  not  compensate),  stabilization  transforms  of  nearby  frames  are 
interpolated,  and  GM  background  is  considered  unreliable  so  LK  flow  and  the  color  histogram  model  are  used  exclusively  for  these  frames. 

At  present,  our  method  stabilizes  the  videos  for  small  amounts  of  translational  and  rotational  camera  movement.  Thus,  for  wide 
camera  sweeps  or  changes  in  perspective  or  scale,  our  stabilization  method  is  not  appropriate.  Scale  compensation,  however,  may  be  integrated 
similarly  to  rotation. 


2.2.  TRACK  STARTING 


The  automated  track  starting  algorithm  based  on  a  confidence-based  spatio-temporal  filtering  algorithm  first  detects  blobs  using  the  GM 
Background  Subtraction  method  [9].  This  yields  difference  images,  which  are  fed  into  the  spatial  filtering  module  below. 

2.2.1.  OPENING  OR  CLOSING  OF  IMAGES  VIA  IMAGE  MORPHING 

The  image  obtained  through  the  background  subtraction  algorithm  is  initially  opened  by  a  structuring  element  with  diameter  3  pixels  to  filter 
out  unnecessary  noise.  By  opening,  we  mean  the  dilation  of  the  erosion  of  a  set  A  by  a  structuring  element  B.  Then  it  is  closed  with  k-means 
clustering  [2].  This  helps  in  detecting  blobs  over  subsequent  frames. 

2.2.2.  SPATIAL  FILTERING 

Once  blobs  are  detected  in  the  difference  images,  they  are  filtered  according  to  their  spatial  features.  The  pseudo  code  for  the  spatial  filtering 
algorithm  is  provided  below.  Scale  information  available  from  the  metadata  accompanying  the  videos  is  used  to  filter  blobs  specifically  based 
on  their  area  and  orientation.  The  filtered  blobs  are  then  passed  as  input  to  the  temporal  filtering  algorithm  below. 

2.2.3.  TEMPORAL  FILTERING 


To  filter  blobs  in  the  temporal  domain  we  use  a  confidence  measure.  Each  blob  has  a  confidence  measure  5  associated  with  it. 

Initially  the  confidence  value  for  each  blob  is  zero.  Confidence  value  for  a  blob  increases  as  it  is  detected  across  successive  frames  In  case  a 
blob  appears  in  consecutive  frames,  the  confidence  value  increases  according  to  a  prior  confidence  measure.  The  confidence  update  equation  is 
as  follows: 

Equation  for  confidence  gain, 


8  =  0.5"" 

And,  equation  for  confidence  loss, 

S  =  -  0.5“" 

Where,  n  is  the  frame  number. 


...(2) 


The  composite  confidence  update  equation  is  as  follows: 


5  =  (0.5~n)  V  (-0.5'“)  ...(3) 

Where,  0.5  indicates  the  increase  in  confidence,  -0.5  the  decrease  in  confidence  and  n  is  the  frame  number. 

So  -n  denotes  that  as  frame  number  increases,  the  confidence  keeps  increasing  either  in  positive  or  negative  direction. 
The  confidence  update  equation  takes  the  form  portrayed  in  fig  2. 


Figure  2:  Confidence  value  update  for  the  frames  (for  increasing  confidence). 

2.2.4.  ADAPTIVE  THRESHOLDING 

If  the  confidence  value  for  a  blob  exceeds  a  specified  upper  threshold  o,  a  track  is  started  on  it.  The  moment  the  confidence  value  for  a  blob 
falls  beneath  a  lower  threshold  x,  the  corresponding  object  is  discarded.  If  the  confidence  value  is  between  a  and  x,  the  corresponding  blob  is 
maintained  in  the  list  of  prospective  tracks.  If  the  confidence  measure  increases  to  a  value  higher  than  the  upper  threshold  a,  then  a  track  is 
started  at  the  pixel  representing  the  object  coordinates.  For  videos  that  have  higher  noise,  clutter  and  random  changes  in  lighting  conditions,  as 
is  often  the  case  for  outdoor  videos  taken  from  moving  cameras,  the  upper  threshold  value  a  is  set  higher.  On  the  other  hand,  for  videos  with 
more  stable  conditions  a  is  set  lower  because  of  the  lesser  probability  of  encountering  random  classification  noise. 


The  track-starting  algorithm: 

1 

begin: 

2 

img  <-  getFrame(video); 

3 

img  <-  STABILIZE  IMAGE(img); 

4 

bw  img  <-  GM  BACKGROUND  SUB TRACTION (i 

mg); 

5 

si  <-  create  structuring  element(3); 

/*here  3  is  the  diameter  of  the  structuring  element*/ 

6 

img  PERF ORM  OPEN  ON  IM AGE(b w  img, si) ; 

/*performs  morphological  opening  on  the  image  */ 

7 

si  <-  create  structuring  element(n); 

/*n  is  chosen  adaptively  ace.  to  the  image  */ 

8 

img  PERF ORM  CLO SEONIM  AGE(img,  si) ; 

/*  performs  morphological  closing  on  the  image  */ 

9 

contour  img  <r  FIND  CONTOUR(img); 

/*  finds  the  boundaries  on  the  image  */ 

10 

count  =  0; 

11 

while(contour  !=  NULL) 

12 

prob  obj  <r  GET  OB J  FROM  CONT OUR(contour_i 
contours  */ 

img);  /*  GET  OBJ  FROM  CONTOUR  finds  each  element  from  the  list  of 

/*  prob  obj  contains  probable  object  */ 

13 

count  <-  count  +  1 ; 

14 

end  while 

15 

for  i  <-  0  to  count 

16 

temp  <r  SP  ATI  ALFILTERIN  G(prob_obj ) ; 

17 

end  for 

18 

while  temp  !=  NULL 

19 

obj  TEMPORALFILTERING(temp); 

20 

end  while 

21 

end 

SP  ATI  ALFILTERIN  G(prob_obj ) 

1 

begin: 

2 

if  (probobj.size  <  T\  AND  probobj.size  >  x2  AND  probobj.height/probobj.  width  <  x3  AND  probobj.height/probobj.  width  >  x4) 

/*  Here  T\  ,x2  ,x3  and  x4  indicate  the  respective  thresholds*/ 

3 

return  prob  obj; 

4 

else 

5 

return  NULL; 

6 

endif 

7 

end 

8 

TEMPORAL  JTLTERING(temp) 

9 

begin: 

10 

for  each  prob  obj 

11 

^probobj  0, 

/*  intialize  weight  of  each  object  detected  as  0.  */ 

12 

end  for 

13 

if  for  video. nextframe  obj  detected  =  prob  obj 

14 

^prob  obj  ^prob  obj  (0.5)  , 

15 

Else 

/*  confidence  update  equations  */ 

16 

^prob  obj  ^prob  obj  (0.5)  , 

17 

end  if 

18 

if  Sprobobj  -  T 

19 

remove  prob  obj  from  list  of  objects; 

20  else  obj  <-  obj  ®  probobj;  /*  append  probobj  to  the  list  of  objects  detected.  ®  represents  the  append  operator  */ 

2 1  end  if 

22  for  each  obj,  if  5prob  obj  >  a 

23  start  tracks  on  obj(X;y);  /*  start  tracks  on  the  pixel(x,y)  representing  the  centroids  of  objects  */ 

24  end  for 

25  return  obj; 

26  end 


2.3.  THE  AGILE  TRACKING  FRAMEWORK 

Object  tracking  is  a  matter  of  determining  the  apparent  motion  of  the  target  object,  keeping  track  of  its  pixel  coordinates.  Many  object  tracking 
methods  are  based  on  optical  flow.  The  fundamental  assumption  of  any  method  used  to  compute  optical  flow  is  that  the  intensity  of  the  target 
object  moves  with  constant  velocity  across  frames.  Existing  methods  like  Kalman  Filter  [8],  based  on  a  Bayesian  model,  and  TLD  [11],  based 
on  Template  Matching,  primarily  use  a  single  learner  to  perform  the  underlying  computations.  In  statistics  and  machine  learning,  ensemble 
methods  use  multiple  models  to  obtain  better  predictive  performance  than  could  be  obtained  from  any  of  the  constituent  models  [3,7,10].  It  can 
be  shown  through  the  following  lemma  that  an  ensemble  learner  performs  better  than  any  of  the  constituent  learners. 

Lemma  1.  Even  a  strong  learner  cannot  endure  situational  variances,  i.e.,  it  cannot  perform  well  in  all  situations. 


Proof.  The  Boosting  algorithm  described  by  Schapire  and  subsequently  proposed  implementations  like  Adaboost  use  Convex  Potantial 
Boosters.  As  shown  in  [19],  for  a  wide  range  of  convex  potential  functions,  any  boosting  algorithm  is  bound  to  encounter  random  classification 
noise.  They  show  that  any  such  boosting  algorithm  is  able  to  classify  examples  correctly  in  absence  of  noise  but  in  the  presence  of  noise  the 
learner  cannot  learn  to  an  accuracy  better  than  1/2.  This  holds  even  if  the  boosting  algorithm  stops  early  or  the  voting  weights  are  bounded. 

Consider  two  sets  of  disjoint  concept  classes  C3  and  C2  such  that  C1  n  C2  =  Now,  if  we  consider  an  instance  space  A  containing 
elements  from  Ch  then  any  E  C2  can  be  classified  as  random  noise  in  A.  So,  effectively  at  least  two  different  learners  L1  and  L2  are  needed 
for  classifying  the  instances  in  X  according  to  C1  and  C2. 


In  the  light  of  the  above  lemma,  we  present  a  new  agile  learning  based  tracking  framework  that  dynamically  switches  between  an  ensemble  of 
classifiers  based  on  a  performance  measure  while  preserving  state  to  deal  with  unforeseen  situational  variances.  An  embodiment  of  the 
framework  with  which  experiments  in  Section  4  were  conducted  uses  a  combination  of  three  methods  for  tracking  object  motion:  Gaussian 
Mixture  (GM)  background  subtraction  [9]  with  mean-shift ,  Lucas-Kanade  (LK)  optical  flow  [1],  and  a  color-histogram  [32]  approach  also 
utilizing  mean-shift.  A  combination  of  these  algorithms  allows  our  tracker  to  track  fast,  slow,  stopped,  and  partially-occluded  objects.  By  an 
agile  learning  based  tracker,  we  imply  that  our  tracker  can  adaptively  switch  dynamically  between  the  constituent  learners  at  mntime  based  on 
velocities  and  certain  measures  of  track  quality  while  preserving  state.  The  next  lemma  proves  that  dynamic  switching  between  the  learners  at 
runtime  yields  more  accurate  results. 

The  new  agile  tracking  framework  uses  an  ensemble  of  k  individual  trackers.  It  allows  adaptive  switching  between  the  constituent 
trackers  dynamically  based  on  a  performance  measure.  The  algorithm  for  adaptive  switching  is  described  below. 

The  switching  algorithm: 

1  SWITCH(): 

2  j  <r  1; 

3  activetracker  <-  T, 

4  compute  the  performance  measure  X 

5  if  X  >  threshold  ® 

6  CHECKPOINT_CURRENT_STATE(); 

7  active  tracker  <-CALL_TRACKER_SELECTOR(); 

8  state  ^  GET_CHECKPOINTED_STATE(); 

9  state  <-  active_tracker( state); 

10  else 

1 1  continue; 

12  endif 

13  if  performance  measure  X  is  minimized 

14  i  ^  i+1 

1 5  endif 


//  Note:  Tj  is  the  jth  tracker 

//saves  the  current  state 
//calls  a  new  tracker 

//returns  the  currently  checkpointed  state 


The  switching  module  is  called  by  the  agile  tracking  algorithm  below: 

The  tracking  algorithm: 


1  AGILETRACKER(freq): 

2  for  each  frame  i, 

3  if  frame  number  %  freq  =  0 

4  call  SWITCH(); 

5  endif 

6  endfor 


In  the  above,  state  refers  to  the  set  of  tuples  ( x,y,n,l ),  where  x  andy  are  the  pixel  coordinates,  n  is  the  frame  number  and  /  is  the 
intensity.  The  agile  tracker  calls  the  switching  algorithm  at  a  user-specified  frequency.  The  switching  algorithm  computes  the  performance 
measure  at  the  current  state.  If  it  exceeds  a  threshold,  the  current  tracker  is  then  substituted  with  a  new  one  obtained  from  an  ensemble  through 
a  pre-defmed  policy  in  such  a  way  that  the  application  of  the  new  tracker  to  the  current  state  results  in  a  state  whose  performance  measure 
value  is  below  the  threshold.  While  switching,  the  current  state  is  checkpointed  so  that  it  can  be  accessed  by  the  new  tracker.  For  the  current 
embodiment  of  the  framework,  we  use  the  linear  function  given  below  as  the  performance  measure 

P  =  ki  *  stabilization  error  +  k2  *  track  overlap  amount  +  k3  *  probabilityjump  detected  +  k4  *  probability  drift  detected  +  k5  * 
high  track  speed  +  k6  *  low  track  speed 

where  klfk2,  ...  ,k6  are  constants  whose  sum  is  1  and  whose  values  depend  on  the  constituent  trackers  in  the  ensemble.  Drift  is  defined  as  a  lack 
of  movement  of  the  track  while  there  is  foreground  motion  present  which  would  cause  the  track  to  continue  to  move. 

GM  background  performs  poorly  during  stabilization  failure,  moderately  well  during  track  overlaps  (when  combined  with  the  object 
passing  algorithm  described  in  2.3.4),  sometimes  jumps  to  background  noise,  tends  not  to  drift,  and  works  best  for  fast  moving  objects.  Thus, 
for  GM,  ki  is  large,  k2  is  slightly  smaller,  k3  is  large,  k4  and  k5  are  0,  and  k6  is  large.  For  simplicity,  assume  k1=k3=k6=0.3,  k2=0.1,  k4=k5=0. 

The  color  histogram  tracker  performs  well  during  stabilization  failures,  moderately  well  during  track  overlaps,  rarely  jumps, 
occasionally  drifts,  and  performs  well  for  fast  or  slow  moving  objects,  though  is  especially  good  for  slow  or  stopped  objects.  For  this  tracker, 
reasonable  parameters  are  k1=k6=0,  k2=k3=0.25,  k4=0.4,  k5=0.1. 


LK  flow  performs  quite  well  during  stabilization  failure,  very  well  during  object  passing,  typically  does  not  jump,  though  often  drifts, 
and  performs  best  for  slow,  but  not  stopped,  objects.  Assuming  high  track  speed  and  low  track  speed  are  both  small  for  objects  moving  at 
such  a  moderate  speed,  reasonable  values  are  k1=k2=k3=0,  k4=0.5,  k5=k6=0.25. 

Parameters  such  as  k4-k6  may  be  determined  experimentally,  since  the  ability  to  track  at  specific  speeds  is  not  prior  knowledge,  and 
may  also  vary  based  on  the  type  of  video  to  be  tracked.  The  performance  measure  quantifies  the  tracking  error  at  the  current  state.  If  more 
information  regarding  the  video  characteristics  is  known,  it  may  be  beneficial  to  experimentally  adjust  the  performance  measure  based  on  those 
characteristics. 

The  next  lemma  shows  that  dynamic  switching  between  individual  trackers  yields  more  accurate  results. 


Lemma  2.  Switching  between  individual  trackers  dynamically  can  decrease  the  upper  bound  for  error  up  to  a  certain  pre-defined 
value. 

Proof.  Suppose  c(v)  is  the  correct  classification  for  v  and  h3(v),  h2(v)  etc.  are  the  classifications  produced  by  the  trackers  Th  T2,  etc 
respectively.  h(v)  is  the  estimate  produced  by  the  effective  composite  tracker  T. 

Here,  T  =  T1 A  T2  A  ...  A  Tn,  where,  Th  T2i  etc  indicates  the  trackers  and  A  indicates  the  switch  operator  on  the  trackers. 

Also,  let  a1?  a2,  etc  be  the  respective  probabilities  of  error  or  misclassifications.  Also,  for  switching  between  trackers  dynamically  at  runtime  we 
incorporate  the  idea  of  defining  adaptive  thresholds  zh  t2  etc.  So,  we  define  the  set  z  =  { zlf  z2 ,  z3,  z4,  z5, zn}  as  the  threshold  for  the  number 
of  misclassifications.  If  the  number  of  misclassifications  for  a  particular  tracker  Tt  exceeds  the  corresponding  threshold  q  we  switch  the  learner. 

Suppose  for  the  ith  tracker,  the  no.  of  misclassifications  become  (zt+l)  at  the  (nj+l)th  instance.  So,  up  to  the  nt  th  instance,  probability 
of  error  or  misclassification 

Pr(h;(v)  4 c(v))  =(~jr)  X  a;  ...(4) 


Also,  let  be  the  upper  bound  of  error  on  any  of  the  individual  trackers.  Hence,  for  the  total  tracking  process,  the  composite  probability  of 
misclassification  is  given  by 

Pr(h(v)  t  c(v))  =  Pr((h,(v)  ±  c(v))  A  (h2(v)  4  c(v))  A  ...  A  (hn(v)  ±  c(v))) 

=  (^)X  a,*1  X  (^)X  a/2  X  (^)X  a3*3  X  ...  X  (^)X  aN*N 

<(^1)X  ^>X(^)X  T2X(^)X  t3X...X(^)X  *N  [Since,  for  alii,  a;  <  ] 


(x  1  +x2+x3 + . . .  +tN) 


...(5) 


Here,  N  is  the  number  of  switches  performed  at  runtime. 


Observations: 

1)  Inequation  (5)  holds  because  each  of  the  terms  <  1  as  well  as  (^+^+t3+...+tn)  <  ?  since,  <  1. 

2)  So,  the  overall  upper  bound  for  the  error  of  the  composite  tracker  is  reduced  owing  to  switching  at  runtime. 

3)  Inequation  (5)  proves  that  the  effective  composite  error  bound  of  the  agile  tracker  T  is  less  than  that  of  the  individual  trackers  Tj. 
1 ,  2  and  3  justify  our  argument  that  using  switching  reduces  the  overall  error  bound. 

Threshold  value  selection  is  a  very  important  criterion  in  optimizing  the  agile  tracker.  In  order  to  evaluate  the  threshold  selection 
criteria,  let  us  concentrate  on  the  simplified  version  of  the  equation  presented  in  (5). 

So,  we  have,  Classification  error 

=  Pr(h(v)*c(v))<  n  =,(^)  Ti  =  n  Ti  ...(6) 

The  error  bound  can  be  minimized  by  increasing  Tj  until  X[  =  PV2-I  • 


In  a  typical  video  scenario,  most  features  are  stationary  from  frame  to  frame  with  only  a  few  objects  moving.  The  stationary  features 
are  considered  to  be  in  the  background,  and  the  moving  objects  are  foreground.  The  GM  background  subtraction  method  described  in  [9] 
efficiently  segments  foreground  and  background  objects  in  real  time,  allowing  for  effective  object  tracking  with  the  mean-shift  algorithm. 
However,  as  is  typical  of  background  segmentation  methods,  it  becomes  less  effective  when  there  is  uncompensated  camera  instability.  Even 
with  a  stable  camera,  this  method  tends  to  lose  foreground  objects  if  there  is  relatively  small  movement  in  the  foreground.  To  compensate  for 
these  deficiencies,  we  also  use  a  more  traditional  and  robust  optical  flow  method  for  object  tracking. 

The  Lucas-Kanade  (LK)  method,  like  many  algorithms  used  to  compute  optical  flow,  imposes  a  constraint  on  the  optical  flow 
problem:  the  displacement  ( Sx ,  Sy)  of  the  image  intensity  from  a  pixel  (x,y)  to  a  pixel  (x+dx,y+dy)  in  the  subsequent  frame  is  small  and  constant 
over  time.  That  is,  it  must  satisfy  for  all  pixels  p  the  equation: 

Ix(p)Vx+Iy(p)Vx=-Il(p),  (?) 

where  Ix,  Iy  and  It  are  the  partial  derivatives  of  the  image  intensity  with  respect  to  x,  y  and  t,  and  Vx  and  Vy  are  the  velocity  vectors.  This  usually 
results  in  an  over-determined  system  and  uses  least-squares  to  find  a  solution.  Due  to  the  constraint  imposed  by  the  method,  it  is  best  suited  for 
an  object  moving  slowly  with  constant  velocity.  We  use  pyramidal  LK.  That  is,  we  compute  LK  at  the  lowest-resolution  image  I0\  then,  having 
obtained  this  lower-resolution  result,  we  compute  LK  incrementally  for  the  next  lowest  resolution  I3.  Similarly,  we  obtain  I2  from  Ih  and  so 
forth  until  reaching  the  full  resolution. 

Combined,  the  LK  method  and  GM  background  tracking  ensure  motion-tracking  performance  superior  that  of  either  method  used 
alone.  However,  neither  method  performs  well  on  objects  which  are  stopped;  GM  tends  to  jump  to  nearby  moving  objects  and  noise  while  LK 
drifts  significantly  over  time.  To  track  these  objects,  we  introduce  a  third  method  based  on  a  color  histogram  of  the  object  being  tracked.  We 
create  a  model  of  the  object  based  on  the  frequency  of  red,  green  and  blue  intensities  in  the  foreground  and  background,  as  obtained  from  GM. 


This  model  slowly  updates  over  time.  A  probability  image  is  created,  which  is  an  image  where  each  pixel  value  corresponds  to  the  predicted 
probability  of  the  object  existing  at  that  point.  Each  pixel  probability  is  computed  from  the  color  histograms  of  the  region  of  interest  using  the 
equations  P(x,y)  =  P/x,y)  *P/x,y)/(P/x,y)  +Pb(x,y))9  where  Pf  is  the  normalized  frequency  of  each  RGB  value  on  the  foreground  histogram  and 
Pb  is  the  frequency  on  the  background  histogram.  Mean-shift  is  then  used  on  the  probability  image  to  track  the  object  by  re-centering  the  region 
of  interest  on  the  center  of  mass  of  the  probability  image. 

When  used  on  full  motion  videos,  object  tracking  presents  an  array  of  challenges.  One  is  camera  instability;  often,  during  recording, 
the  camera  shakes,  pans,  or  rotates,  which  causes  background  objects  to  appear  to  move.  A  second  is  poor  image  quality  due  to  low-definition 
recording  equipment  or  long  distance;  this  obscures  images  and  interferes  with  the  tracking  process.  A  third  is  the  need  for  real-time  tracking, 
which  requires  simple,  efficient  methods  to  keep  up  with  the  pace  of  real-time  input. 

2.3.2.  AGILE  TRACKING  VS.  OTHER  ENSEMBLE  BASED  TRACKERS 

A  tracker  based  on  an  ensemble  machine  learning  technique  like  boosting  (e.g.,  Adaboost)  will  create,  based  on  training  data  an  optimal  tracker 
of  the  form: 


T  =7 


...(8) 


where  P  is  the  number  of  rounds,  tp  is  a  tracker  in  the  ensemble,  and 


1 


are  weights  such  that  ^  ap  : 

While  running  on  actual  data  T  will  need  to  run  all  the  P  trackers  on  each  data  point  (i.e.,  frame)  and  compute  a  weighted  sum  of  the 
outputs.  In  our  case  only  one  tracker  is  active  at  any  particular  time,  i.e.,  only  one  tracker  is  run  on  each  data  point.  This  is  crucial  for  real  time 
performance. 

Moreover,  in  boosting,  the  weights  are  fixed  once  the  training  is  over.  This  can  create  problems  if  the  character  of  the  data 
changes  drastically  from  the  examples  on  which  the  training  is  performed  due  to  changes  in  background,  lighting  conditions,  etc.  This  can  be 
avoided  in  the  agile  framework  by  having  multiple  boosted  trackers  in  the  ensemble  and  switching  them  accordingly  using  the  SWITCH() 
method  (of  course  increasing  the  computational  cost)  but  definitely  yielding  higher  performance. 


2.3.3.  IMAGE  QUALITY  AND  REAL-TIME  TRACKING 

In  the  current  embodiment  of  the  framework,  handling  poor  image  quality  with  real-time  tracking  is  primarily  handled  through  the  use  of  the 
three  tracking  algorithms.  Utilizing  all  these  methods  ensures  a  better  result  than  any  one  alone;  LK  flow  succeeds  where  GM  background 
subtraction  fails,  and  vice  versa.  For  a  blurry,  low-quality,  quickly-moving  object,  GM  background  subtraction  works  well  as  long  as  the  image 
is  well-stabilized.  LK  flow  can  track  slower  objects  well,  and  works  without  stabilization  information.  The  color  histogram  also  works  during 
stabilization  failure,  and  can  track  stopped  objects  better  than  the  other  algorithms.  If  a  failure  is  detected  in  any  algorithm,  defined  as  the 
performance  measure  exceeding  a  certain  threshold,  we  can  simply  switch  to  another  algorithm  and  continue  tracking. 

LK  flow  is  used  for  stabilizing  the  image,  so  the  marginal  cost  to  use  it  as  a  tracking  algorithm  is  relatively  low.  Using  the  stated 
algorithms  together  is  an  efficient  choice  to  achieve  real-time  tracking. 


2.3.4.  OBJECT  PASSING 


One  problem  with  GM  background  subtraction  is  when  two  moving  objects  are  nearby  or  occluded,  it  becomes  difficult  to  separate  them. 
Likewise,  with  Lucas-Kanade,  the  boundaries  of  the  tracked  objects  must  be  approximately  known.  Even  the  color  histogram  model  will  not 
always  separate  two  similarly  colored  cars.  To  account  for  this,  we  create  a  probability  image  when  two  objects  are  nearby,  consisting  of  two 
Gaussians.  This  probability  image  contains  pixel  values  equal  to  the  expected  probability  of  the  object  being  centered  at  each  pixel  location, 
based  on  the  expected  motion  of  each  object.  The  first  object  cannot  move  to  where  the  probability  is  0  (e.g.  at  the  center  of  the  second  object), 
and  likewise  for  the  second  object.  This,  along  with  preventing  large  jumps,  usually  solves  the  problem  when  two  objects  pass  each  other  in  the 
near  vicinity. 

2.3.5.  DISTINGUISHING  BETWEEN  HUMANS  AND  CARS 

A  track  is  classified  as  a  human  or  car  based  on  the  blob  size  and  aspect  ratio.  A  confidence  measure  is  built  up  over  time,  and,  if  necessary,  a 
correction  may  be  made.  If  either  the  area  or  aspect  ratio  alone  is  a  strong  indicator  of  the  presence  of  a  car,  then  only  this  metric  is  used  to 
devise  the  classifier.  Otherwise,  both  area  and  aspect  ratio  must  be  used.  acar  and  ohuman  are  initially  set  to  zero,  so  an  estimate  may  be 
immediately  obtained.  They  should  then  be  changed  to  nonzero  values  to  prevent  fluctuations  between  human  and  car  detections  due  to 
inaccurate  blobs.  The  size  and  aspect  ratio  of  the  region  of  interest  used  depends  on  the  human  or  car  classification. 

The  human/car  classification  algorithm: 

1  begin: 

2  sz  <-  getBlobSize(track); 

3  if  sz. width* sz.height>xcar  areai  OR  sz.  width*  sz.height>xcar  area2  AND  sz.width/sz.height>xcar  aspect2  OR  sz.width/sz.height>xcar  aspecti 

/*  xcar  areal  and  xcar  aspecti  are  thresholds  where  it  is  almost  certain  that  the  tracked  object  is  a  car.  xcar  area2  and  xcar  aspect2  have  a 
lower  confidence  */ 

4  probableCar  <-  1 ; 

5  probableHuman  <-  0; 

6 


else 


7  probableCar  <r  0; 

8  probableHuman  <-  1; 

9  endif 

10  carConfidence  <-  carConfidence  +  probableCar  -  probableHuman; 

11  if  carConfidence>  ocar 

12  track  is  a  car  /*  ocar  is  generally  >  0  and  ohuman  <  0  */ 

13  else  if  carConfidence<  ohuman 

14  track  is  a  human 

1 5  endif 

16  end 


3.  IMPLEMENTATION  OF  OUR  APPROACH 

We  implemented  tracking  in  C++  using  the  OpenCV  library  for  real-time  computer  vision.  The  ensemble  in  our  case  consisted  of  three 
individual  algorithms:  Gaussian  Mixture  Background  Subtraction  with  mean-shift ,  Lucas-Kanade  optical  flow,  and  the  color  histogram  model 
with  mean-shift.  The  selection  of  k{  to  k6  is  explained  above  in  section  2.3,  and  may  vary  based  on  known  tracking  algorithm  characteristics. 
The  switching  algorithm  is  called  by  the  agile  tracker  every  frame. 

4.  RESULTS  AND  COMPARITIVE  STUDIES 

We  compare  the  results  from  our  tracker  against  seven  existing  trackers  whose  outputs  are  available  at  the  publicly  available  TLD  dataset  [12]. 
Table  1  shows  the  number  of  frames  after  which  the  trackers  lost  track  for  the  first  time.  Table  2  gives  the  number  of  frames  up  to  the  first  track 
loss  for  the  TUD  dataset  [21].  The  measure  proves  to  be  effective  in  the  absence  of  a  track  merging  algorithm.  The  agile  tracker  performs 
significantly  well  in  most  of  the  cases.  Fig  3  shows  the  outputs  of  the  agile  tracker  on  the  TLD  dataset.  Also  TLD  is  based  on  template 
matching  and  hence  fails  for  videos  with  multiple  numbers  of  similar  looking  objects.  This  is  illustrated  in  Fig  4  where  TLD  switches  tracks 
arbitrarily  between  similar  looking  foreground  objects  whereas  the  agile  tracker  keeps  tracking  a  particular  object  for  the  entire  time  frame  of 
its  visibility.  The  full  length  tracked  videos  along  with  further  results  on  VIRAT  data  are  available  at  [15]. 

We  also  compare  our  tracker  against  the  TUD  Pedestrian  Detector  for  multi-object  tracking.  A  measure  of  the  total  number  of  correct  and  false 
object  detections  is  used. 


Algorithms 

Jumping 

Total 

number  of 
frames=3 1 3 

Car 

Total 

number  of 
frames=945 

Motocross 

Total 

number  of 
frames=2665 

Car  chase 

Total 

number  of 
frames=9928 

Panda 

Total 

number  of 
frames=3000 

Beyond  semi- 
supervised 
tracking  [12] 

14 

28 

6 

66 

130 

Co-trained 
Generative- 
Discriminative 
tracking  [12] 

11 

34 

1 

1 

1 

“CVPR”  results 
as  given  in  [12] 

96 

29 

59 

334 

358 

Online  Multiple 
Instance 

Learning  [12] 

313 

220 

63 

321 

992 

On-line 

Boosting  [12] 

26 

545 

15 

316 

1004 

Semi-Supervised 

On-line 

Boosting  [12] 

21 

652 

59 

190 

83 

TLD  [12] 

313 

802 

173 

244 

277 

Agile  Tracker 

313 

581 

110 

402 

2568 

Table  1.  Comparison  of  the  various  single-object  trackers 


Campus 
Correct  (False) 

Crossing 
Correct  (False) 

Expected  Detections 

303 

1008 

TUD  Pedestrian  Detector  [21] 

227  (0) 

692  (7) 

Agile  Tracker 

222  (0) 

541  (28) 

Table  2.  Comparison  of  the  multi-object  trackers 


Figure  3:  Results  from  the  agile  tracker 


Figure  4:  The  top  one  represents  the  output  from  the  agile  tracker  and  the  bottom  one  represents  that  from  TLD. 


Figure  5:  Agile  Tracker  results  for  the  TUD  campus  and  crossing  videos 


5.  CONCLUSIONS 

Our  novel  algorithm  for  starting  tracks  using  confidence  measure  and  adaptive  thresholding  not  only  performs  in  real  time  but  is  also  accurate. 
The  agile  tracking  framework  allows  dynamic  adaptive  switching  within  an  ensemble  of  tracking  algorithms  based  on  a  performance  measure 
while  preserving  state  providing  more  accuracy  than  any  of  the  individual  algorithms.  We  believe  that  the  presented  framework  provides  the 
foundation  for  real  time  video  activity  recognition. 
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